Indexing Weighted Sequences: Neat and Efficient
نویسندگان
چکیده
In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold 1 z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m− 1 in X is at least 1 z . In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n over a constant-sized alphabet that answers pattern matching queries in optimal, O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of ⌊z⌋ special strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We obtain a weighted index with the same complexities as in the most efficient previously known index by Barton et al. [3], but our construction is significantly simpler. The most complex algorithmic tool required in the basic form of our index is the suffix tree which we use to develop a new, more straightforward index for the so-called property matching problem. We provide an implementation of our data structure. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. [6] and an improvement of the space complexity of their general index.
منابع مشابه
Novel Approaches to Biomolecular Sequence Indexing
In many biomolecular database applications involving string/sequence data, it is common to have similarity search in the form of near neighbor queries or nearest neighbor queries. The similarity between strings/sequences are typically measured in terms of the least costly set of allowed edit operations that transform one string/sequence to another. In this survey, we briefly describe some of th...
متن کاملCOMBINING FUZZY QUANTIFIERS AND NEAT OPERATORS FOR SOFT COMPUTING
This paper will introduce a new method to obtain the order weightsof the Ordered Weighted Averaging (OWA) operator. We will first show therelation between fuzzy quantifiers and neat OWA operators and then offer anew combination of them. Fuzzy quantifiers are applied for soft computingin modeling the optimism degree of the decision maker. In using neat operators,the ordering of the inputs is not...
متن کاملI-45: Advance MRI Sequences in Pelvic Endometriosis
Background: To assess MRI in diagnosing endometriotic lesions, emphasizing T2*weighted imaging efficacy. Materials and Methods: This prospective study of 48 females (22-38 years, average 29.6) clinically suspected of endometriosis from September 2009 to April 2012. MRI was performed with a 1.5 T imager (Siemens) with a body array coil. T1, T2 and T2* weighted (2D-FLASH) sequences were obtained ...
متن کاملIndexing Weighted-Sequences in Large Databases
We present an index structure for managing weightedsequences in large databases. A weighted-sequence is defined as a two-dimensional structure where each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence in that each event has a timestamp. Querying a large sequence database by events’ occurrence patterns is a first step towards...
متن کاملEfficient Similarity Search for Time Series Data Based on the Minimum Distance
We address the problem of efficient similarity search based on the minimum distance in large time series databases. Most of previous work is focused on similarity matching and retrieval of time series based on the Euclidean distance. However, as we demonstrate in this paper, the Euclidean distance has limitations as a similarity measurement. It is sensitive to the absolute offsets of time seque...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1704.07625 شماره
صفحات -
تاریخ انتشار 2017